library(dplyr)
# or
library(tidyverse)Day 2 - Introduction to Data Analysis with R
Freie Universität Berlin - Theoretical Ecology
October 1, 2025
Data transformation is an important step in understanding the data and preparing it for further analysis.
We can use the tidyverse package dplyr for this.
With dplyr we can (among other things)
All of the dplyr functions work similarly:
Data set and_vertebrates with measurements of a trout and 2 salamander species in different forest sections.
year: observation yearsection: CC (clear cut forest) or OG (old growth forest)unittype: channel classification (C = Cascade, P = Pool, …)species: Species measuredlength_1_mm: body length [mm]weight_g: body weight [g]Data set and_vertebrates with measurements of a trout and 2 salamander species in different forest sections.
library(lterdatasampler)
#> Error in library(lterdatasampler): there is no package called 'lterdatasampler'
vertebrates <- and_vertebrates |>
select(year, section, unittype, species, length_1_mm, weight_g) |>
filter(species != "Cascade torrent salamander")
#> Error: object 'and_vertebrates' not found
vertebrates
#> Error: object 'vertebrates' not foundfilter()picks rows based on their value
filter()Filter only the trout species:
filter() goes through each row of the data and return only those rows where the value for species is "Cutthroat trout"
filter()You can also combine filters using logical operators (&, |, !):
filter() + %in%Use the %in% operator to filter rows based on multiple values, e.g. unittypes
filter() + is.na()Filter only rows that don’t have a value for the weight
filter() + between()Filter rows where the value for year is between 2000 and 2005
filter() helpersThese functions and operators help you filter your observations:
<, >, ==, …&, |, !%in% to filter multiple valuesis.na() to filter missing valuesbetween() to filter values that are between an upper and lower boundarynear() to compare floating points (use instead of == for doubles)select()picks columns based on their names
select()Select the columns species, length_1_mm, and year
select() + starts_with()Select all columns that start with "s"
#> Error: object 'vertebrates' not found
select() helpersstarts_with() and ends_with(): variable names that start/end with a specific stringcontains(): variable names that contain a specific stringmatches(): variable names that match a regular expressionany_of() and all_of(): variables that are contained in a character vectormutate()Adds new columns to your data
mutate()New columns can be added based on values from other columns
#> Error: object 'vertebrates' not found
mutate() + case_when()Use case_when to add column values conditional on other columns.
case_when() can combine many cases into one.
summarize()summarizes data
summarize()summarize will collapse the data to a single row
summarize() by groupsummarize is much more useful in combination with the grouping argument .by
.by = c(species, unittype))count()Counts observations by group
|>Combine multiple data operations into one command
|>Data transformation often requires multiple operations in sequence.
The pipe operator |> helps to keep these operations clear and readable.
%>% from the magrittr packageTurn on the native R pipe |> in Tools -> Global Options -> Code
|>Let’s look at an example without pipe:
How could we make this more efficient?
Use one nested function without intermediate results:
But this gets complicated and error prone very quickly
|>The pipe operator makes it very easy to combine multiple operations:
You can read from top to bottom and interpret the |> as an “and then do”.
|>But what is happening?
The pipe is “pushing” the result of one line into the first argument of the function from the next line.
Piping works perfectly with the tidyverse functions because they are designed to return a tibble and take a tibble as first argument.
Tip
Use the keyboard shortcut Ctrl/Cmd + Shift + M to insert |>
|>Piping also works well together with ggplot
bind_rowsSituation: Two (or more) tibbles with the same variables (column names)
#> Error: object 'tbl_a' not found
#> Error: object 'tbl_b' not found
bind_rowsBind the rows together with bind_rows():
#> Error: object 'tbl_a' not found
You can also add an ID-column to indicate which line belonged to which table:
#> Error: object 'tbl_a' not found
left_join()Situation: Two tables that share some but not all columns.
left_join()Join the two tables by the common column species
left_join() means that the resulting tibble will contain all rows of vertebrates, but not necessarily all rows of species (in this case it does though).
*_join() functionsData transformation with dplyr
All dplyr functions take a tibble as first argument and return a tibble.
filter()%in%is.na()between()near()All dplyr functions take a tibble as first argument and return a tibble.
select()starts_with(), ends_with()contains()matches()any_of(), all_of()arrange()desc()mutate()case_when() for conditional valuessummarize().by argument to summarize by groupcountbind_rows().id = "id"bind_cols() works similarly just for columnsleft_join()Task (45 min)
Transform the penguin data set
Find the task description here
Selina Baldauf // Data transformation with dplyr